Skip to content

Conversation

@minimAluminiumalism
Copy link
Contributor

@minimAluminiumalism minimAluminiumalism commented Sep 20, 2025

…ty integration

  • I have added tests that cover my changes.
  • If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
  • PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
  • (If applicable) I have updated the documentation accordingly.

This PR mainly fixes two issues related to langchain third party integration, specifically the API ChatDeepSeek:

  • Model name detection failing
  • Metrics missing like TTFT and streaming generation time

Repoduce code

import os
import asyncio
from langchain_core.prompts import ChatPromptTemplate

from langchain_deepseek import ChatDeepSeek

# OpenTelemetry setup (HTTP exporter to local collector)
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

from opentelemetry.instrumentation.langchain import LangchainInstrumentor
from opentelemetry import metrics as otel_metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader
from opentelemetry import trace as otel_trace

prompt_template = """You are a helpful assistant.
Use the following context to answer briefly.

Context:
{context}

Question:
{question}
"""

def init_otel_and_instrument(service_name: str = "langchain-scratch", collector_endpoint: str = "http://127.0.0.1:4318") -> None:    
    resource = Resource.create({"service.name": service_name})
    
    endpoint = collector_endpoint.rstrip("/") + "/v1/traces"
    trace_exporter = OTLPSpanExporter(endpoint=endpoint)
    tracer_provider = TracerProvider(resource=resource)
    tracer_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
    otel_trace.set_tracer_provider(tracer_provider)

    metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter(), export_interval_millis=1000)
    meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
    otel_metrics.set_meter_provider(meter_provider)

    langchain_instrumentor = LangchainInstrumentor()
    langchain_instrumentor.instrument(
        tracer_provider=tracer_provider,
        meter_provider=meter_provider
    )
    
    return langchain_instrumentor

async def main():
    instrumentor = init_otel_and_instrument(
        service_name="langchain-scratch", 
        collector_endpoint="http://127.0.0.1:4318"
    )
    
    prompt = ChatPromptTemplate.from_template(prompt_template)

    api_key = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
    base_url = os.getenv("OPENAI_BASE_URL", "https://api.deepseek.com/beta")
    model_name = os.getenv("MODEL_NAME", "deepseek-reasoner")

    model = ChatDeepSeek(
        api_base=base_url,
        api_key=api_key,
        model=model_name,
        stream_usage=True,
    )

    chain = prompt | model

    inputs = {"context": "some context", "question": "What's OpenTelemetry?"}
    print("Assistant:", end=" ", flush=True)
    async for chunk in chain.astream(inputs):
        piece = getattr(chunk, "content", "")
        if piece:
            print(piece, end="", flush=True)
    print()
    
    # instrumentor.uninstrument()

if __name__ == "__main__":
    asyncio.run(main())
Clipboard_Screenshot_1758409020

Important

Fixes model name detection and adds missing metrics for Langchain's ChatDeepSeek integration, including TTFT and streaming generation time.

  • Behavior:
    • Fixes model name detection and adds metrics like TTFT and streaming generation time for ChatDeepSeek in callback_handler.py.
    • Adds _create_shared_attributes() in TraceloopCallbackHandler to create shared attributes for metrics.
    • Updates on_llm_new_token() and on_llm_end() to track TTFT and streaming metrics.
  • Metrics:
    • Adds histograms for TTFT and streaming time, counters for generation choices and exceptions in __init__.py.
    • Updates set_request_params() in span_utils.py to enhance model extraction.
  • Tests:
    • Adds tests for model extraction and streaming metrics in test_model_extraction.py, test_streaming_metrics.py, and test_third_party_models.py.
    • Verifies metrics recording in test_langchain_metrics.py.

This description was created by Ellipsis for 15aff50. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • New Features
    • Added AI streaming metrics: time-to-first-token, streaming time-to-generate, generation choices, and exception counts; integrated into LLM telemetry with richer model/system attributes.
  • Bug Fixes
    • More reliable model-name detection across requests, responses, metadata and third‑party models; safer lifecycle handling.
  • Tests
    • New suites for streaming metrics, model extraction, third‑party models and an HTTP cassette for streaming.
  • Chores
    • Added third‑party model dependency and enabled test recording plugin.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to 15aff50 in 1 minute and 28 seconds. Click for details.
  • Reviewed 1337 lines of code in 7 files
  • Skipped 0 files when reviewing.
  • Skipped posting 6 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/__init__.py:72
  • Draft comment:
    New histograms for TTFT and streaming time are added. Consider using a monotonic clock (e.g. time.monotonic()) for duration measurements to avoid issues on systems with clock adjustments, and add inline documentation for these metrics.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The suggestion about monotonic clock could be valid for precise timing measurements. However, we don't see the actual timing measurement code in this file - it's likely in the callback handler. The documentation suggestion is not needed since there are already clear descriptions in the create_histogram calls. Without seeing the actual timing code, we can't be certain the monotonic clock suggestion is relevant. I might be wrong about the monotonic clock suggestion - it could be a critical best practice for time measurements in distributed systems. The current descriptions might also be insufficient for other developers. While the monotonic clock suggestion might be valid, we can't verify its relevance without seeing the timing implementation. The existing metric descriptions are clear and sufficient. Delete the comment since we can't verify if the monotonic clock suggestion is applicable, and the documentation request is already satisfied by existing descriptions.
2. packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py:584
  • Draft comment:
    The on_llm_new_token method correctly tracks TTFT on the first token. Ensure that, in high-concurrency scenarios, potential race conditions are handled if multiple tokens are processed concurrently.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the author to ensure that race conditions are handled in high-concurrency scenarios. This falls under asking the author to ensure behavior, which is against the rules. The comment does not provide a specific suggestion or point out a specific issue in the code.
3. packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py:668
  • Draft comment:
    The fallback logic for obtaining the model name in on_llm_end is robust. Consider adding a brief inline comment to clarify the priority order of model name extraction for future maintainability.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
4. packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py:155
  • Draft comment:
    The model inference logic in _infer_model_from_class_name handles known third-party integrations. Consider documenting the expected format of the serialized input to improve maintainability.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While documenting input formats can be helpful, in this case the function is already well-structured and has clear error handling. The serialized dict format is fairly simple - it just needs a potential "kwargs" field that may contain model fields. The code handles missing or malformed input gracefully by falling back to unknown model names. Additional documentation would be nice-to-have but not critical. The comment has merit since documenting expected input formats is generally good practice. However, the current code handles all edge cases gracefully and the format is fairly straightforward. While documentation could be improved, this seems more like a nice-to-have suggestion rather than a critical issue requiring immediate action. The code is defensive and handles malformed input well. Delete this comment as it suggests a nice-to-have documentation improvement rather than a critical code change. The code already handles inputs robustly.
5. packages/opentelemetry-instrumentation-langchain/tests/test_third_party_models.py:118
  • Draft comment:
    The tests for ChatDeepSeek integration effectively cover both standard extraction and fallback cases. Ensure that if model naming conventions change in the future, these tests are updated accordingly.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is purely informative and suggests ensuring future updates, which violates the rules. It doesn't provide a specific code suggestion or ask for a specific test to be written.
6. packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py:231
  • Draft comment:
    There appears to be an extra closing parenthesis before the '-> None:' in the function signature. Please check if this is a typographical error.
  • Reason this comment was not posted:
    Comment was on unchanged code.

Workflow ID: wflow_bfKB5E9v4FYJHRmv

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

self.handler.spans[run_id] = span_holder

# Simulate first token arrival after a small delay
time.sleep(0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests simulate delays using time.sleep(), which can be fragile in CI environments. Consider using a time mocking approach to simulate passage of time in a more stable way.

@gitguardian
Copy link

gitguardian bot commented Sep 21, 2025

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
While these secrets were previously flagged, we no longer have a reference to the
specific commits where they were detected. Once a secret has been leaked into a git
repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@coderabbitai
Copy link

coderabbitai bot commented Sep 21, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds GenAI incubating metrics (TTFT, streaming time, choices counter, exception counter) to LangChain instrumentation, wires them through instrumentor and TraceloopCallbackHandler, enhances model-name extraction and span metadata handling, adds DeepSeek dependency and streaming test cassette, and expands tests for streaming/third‑party models and metrics.

Changes

Cohort / File(s) Summary
Instrumentation init & wiring
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/__init__.py
Imports GenAIMetrics; creates TTFT and streaming-time histograms and choices/exception counters; passes them into TraceloopCallbackHandler during instrumentation setup.
Callback handler metrics & flow
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py
Extends TraceloopCallbackHandler.__init__ to accept ttft_histogram, streaming_time_histogram, choices_counter, exception_counter; adds on_llm_new_token; centralizes shared metric attributes; records TTFT, streaming duration, choices, and exception metrics; updates error handling and model-name resolution fallbacks.
Span/model extraction utilities
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
Adds first_token_time to SpanHolder; implements unified model extraction functions and fallbacks (_get_unified_unknown_model, _extract_model_name_from_request, _infer_model_from_class_name, _extract_model_name_from_association_metadata); updates set_request_params, set_llm_request, set_chat_request, and response-model extraction to use serialized/metadata inputs.
Project configuration
packages/opentelemetry-instrumentation-langchain/pyproject.toml
Adds dependency langchain-deepseek = "^0.1.4" to main and test dependency groups.
Test config
packages/opentelemetry-instrumentation-langchain/tests/conftest.py
Enables pytest plugin pytest_recording via pytest_plugins.
Metrics tests
packages/opentelemetry-instrumentation-langchain/tests/metrics/test_langchain_metrics.py
Imports ERROR_TYPE and GenAIMetrics; expands assertions to validate GEN_AI_SERVER_TIME_TO_FIRST_TOKEN, LLM_STREAMING_TIME_TO_GENERATE, choices and exception metrics and related attributes.
Streaming cassette
packages/opentelemetry-instrumentation-langchain/tests/metrics/cassettes/test_langchain_metrics/test_streaming_with_ttft_and_generation_time_metrics.yaml
Adds DeepSeek streaming interaction cassette with chunked data: lines and final data: [DONE] response.
Streaming unit tests
packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py
New test suite TestStreamingMetrics validating TTFT recording on first token, no-repeat TTFT, choices counting, streaming-time, and exception metrics using mocks.
Third‑party model tests
packages/opentelemetry-instrumentation-langchain/tests/test_third_party_models.py
New tests verifying model extraction from serialized kwargs for DeepSeek, fallback deepseek-unknown, and correct metric attribution on on_llm_end.
Model extraction tests
packages/opentelemetry-instrumentation-langchain/tests/test_model_extraction.py
New tests covering multi-path model extraction (kwargs, invocation_params, serialized, metadata), class-name inference, association metadata, and response metadata parsing.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant App
  participant LangChain
  participant Callback as TraceloopCallbackHandler
  participant Span as Span/SpanHolder
  participant OTel as OTel Metrics

  App->>LangChain: invoke LLM/chat (streaming)
  LangChain->>Callback: on_chat_model_start / on_llm_start(kwargs, serialized, metadata)
  Callback->>Span: set_request_params(span_holder, kwargs, serialized, metadata)
  Note right of Callback: resolve model_name via kwargs/serialized/metadata/class-name

  loop stream tokens
    LangChain->>Callback: on_llm_new_token(token, run_id)
    alt first token
      Callback->>Span: record first_token_time
      Callback->>OTel: ttft_histogram.record(value, attributes)
    else subsequent token
      Callback-->>OTel: (no TTFT)
    end
  end

  LangChain->>Callback: on_llm_end(result, run_id)
  Callback->>OTel: duration_histogram.record(...)
  Callback->>OTel: token_histogram.record(...)
  opt streaming
    Callback->>OTel: streaming_time_histogram.record(...)
    Callback->>OTel: choices_counter.add(n, attributes)
  end
  Callback->>Span: end span & cleanup
Loading
sequenceDiagram
  autonumber
  participant LangChain
  participant Callback
  participant Span
  participant OTel as OTel Metrics

  LangChain->>Callback: on_llm_error(error, run_id)
  Callback->>Span: resolve model_name & error type
  Callback->>OTel: exception_counter.add(1, attributes: error.type, model, system, server.address)
  Callback-->>LangChain: propagate/return
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • nirga
  • doronkopit5

Poem

A nibble of tokens, the first one sweet,
My whiskers twitch—TTFT’s complete!
I count the choices as streams flow by,
If errors hop up, I mark them high.
DeepSeek dreams in metrics neat. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.27% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title accurately and concisely describes the primary change—fixing missing span attributes and metrics for the LangChain third‑party integration—which matches the modifications to span_utils.py, callback_handler.py, and the added metrics/tests; it uses a conventional commit prefix and avoids unnecessary detail. The wording is slightly awkward grammatically but remains specific and on-topic for a reviewer scanning PR history. Overall the title communicates the main intent of the changeset.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/opentelemetry-instrumentation-langchain/tests/conftest.py (1)

150-174: Harden cassette scrubbing: handle bytes/JSON, nested secrets, and response bodies; expand header/query filtering.

Reduce risk of leaking keys in recorded cassettes and make body-type handling robust.

Apply:

@@
 def vcr_config():
-    def before_record_request(request):
-        if hasattr(request, "body") and request.body:
-            import json
-
-            try:
-                if isinstance(request.body, (str, bytes)):
-                    body_str = (
-                        request.body.decode("utf-8")
-                        if isinstance(request.body, bytes)
-                        else request.body
-                    )
-                    body_data = json.loads(body_str)
-                    if "api_key" in body_data:
-                        body_data["api_key"] = "FILTERED"
-                        request.body = json.dumps(body_data)
-            except (json.JSONDecodeError, UnicodeDecodeError, AttributeError):
-                pass
-        return request
+    def before_record_request(request):
+        import json, re
+        SENSITIVE_KEYS = {"api_key", "apikey", "key", "token", "access_token", "authorization", "password"}
+
+        def _scrub(obj):
+            if isinstance(obj, dict):
+                for k, v in list(obj.items()):
+                    if k.lower() in SENSITIVE_KEYS:
+                        obj[k] = "FILTERED"
+                    else:
+                        _scrub(v)
+            elif isinstance(obj, list):
+                for i in obj:
+                    _scrub(i)
+
+        if hasattr(request, "body") and request.body:
+            body = request.body
+            try:
+                body_str = body.decode("utf-8") if isinstance(body, (bytes, bytearray)) else str(body)
+                try:
+                    data = json.loads(body_str)
+                    _scrub(data)
+                    body_scrubbed = json.dumps(data)
+                except json.JSONDecodeError:
+                    # Fallback: redact bearer/API keys in raw text
+                    body_scrubbed = re.sub(r'(?i)(bearer\s+)[A-Za-z0-9._-]+', r'\1FILTERED', body_str)
+                    body_scrubbed = re.sub(r'(?i)("?(api_?key|token|access_token)"?\s*:\s*)"[^"]+"', r'\1"FILTERED"', body_scrubbed)
+                request.body = body_scrubbed.encode("utf-8") if isinstance(body, (bytes, bytearray)) else body_scrubbed
+            except (UnicodeDecodeError, AttributeError, TypeError):
+                pass
+        return request
+
+    def before_record_response(response):
+        # VCRpy response is a dict-like structure
+        import json, re
+        try:
+            body = response.get("body", {}).get("string")
+            if not body:
+                return response
+            body_str = body.decode("utf-8") if isinstance(body, (bytes, bytearray)) else str(body)
+            try:
+                data = json.loads(body_str)
+                # Best-effort scrub
+                def _scrub(obj):
+                    if isinstance(obj, dict):
+                        for k in list(obj.keys()):
+                            if str(k).lower() in {"api_key", "apikey", "key", "token", "access_token", "authorization", "password"}:
+                                obj[k] = "FILTERED"
+                            else:
+                                _scrub(obj[k])
+                    elif isinstance(obj, list):
+                        for i in obj:
+                            _scrub(i)
+                _scrub(data)
+                body_scrubbed = json.dumps(data)
+            except json.JSONDecodeError:
+                body_scrubbed = re.sub(r'(?i)(bearer\s+)[A-Za-z0-9._-]+', r'\1FILTERED', body_str)
+            response["body"]["string"] = body_scrubbed.encode("utf-8") if isinstance(body, (bytes, bytearray)) else body_scrubbed
+        except Exception:
+            pass
+        return response
@@
-    return {
-        "filter_headers": ["authorization", "x-api-key"],
-        "match_on": ["method", "scheme", "host", "port", "path", "query"],
-        "before_record_request": before_record_request,
-    }
+    return {
+        "filter_headers": ["authorization", "Authorization", "x-api-key", "X-API-KEY"],
+        "filter_query_parameters": ["api_key", "key", "token", "access_token"],
+        "match_on": ["method", "scheme", "host", "port", "path", "query"],
+        "before_record_request": before_record_request,
+        "before_record_response": before_record_response,
+    }
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (1)

334-341: Incorrect type check: use isinstance instead of identity comparison

Comparison to the str type is always False; causes JSON-quoted strings in completions.

-                if generation.message.content is str:
+                if isinstance(generation.message.content, str):
🧹 Nitpick comments (11)
packages/opentelemetry-instrumentation-langchain/tests/conftest.py (5)

93-95: Uninstrument in reverse order of instrumentation.

Reduces chance of teardown interdependencies.

-    openai_instrumentor.uninstrument()
-    langchain_instrumentor.uninstrument()
-    bedrock_instrumentor.uninstrument()
+    bedrock_instrumentor.uninstrument()
+    langchain_instrumentor.uninstrument()
+    openai_instrumentor.uninstrument()

88-90: Pass meter_provider to Bedrock instrumentor too.

Keeps metrics pipeline consistent across instrumentors.

-    bedrock_instrumentor = BedrockInstrumentor()
-    bedrock_instrumentor.instrument(tracer_provider=tracer_provider)
+    bedrock_instrumentor = BedrockInstrumentor()
+    bedrock_instrumentor.instrument(tracer_provider=tracer_provider, meter_provider=meter_provider)

68-73: Set explicit service.name on the MeterProvider resource.

Helps disambiguate telemetry from parallel test runs.

-    resource = Resource.create()
+    resource = Resource.create({"service.name": "langchain-instrumentation-tests"})

137-147: Add DEEPSEEK_API_KEY default for third‑party tests.

Prevents accidental skipping/failure if DeepSeek-dependent tests are added.

     if not os.environ.get("TAVILY_API_KEY"):
         os.environ["TAVILY_API_KEY"] = "test"
+    if not os.environ.get("DEEPSEEK_API_KEY"):
+        os.environ["DEEPSEEK_API_KEY"] = "test"

106-109: Remove unused local variable in fixtures.

Minor cleanup; yield the session instrumentor directly.

-    instrumentor = instrument_legacy
-
-    yield instrumentor
+    yield instrument_legacy
-    instrumentor = instrument_legacy
-
-    yield instrumentor
+    yield instrument_legacy

Also applies to: 123-126

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/__init__.py (1)

72-99: New streaming and completion metrics properly configured.

The new metrics (TTFT histogram, streaming time histogram, choices counter, and exception counter) are correctly configured with appropriate names, units, and descriptions. The metric names align with OpenTelemetry semantic conventions.

One minor inconsistency: the exception counter uses a custom name "llm.langchain.completions.exceptions" instead of following the pattern from existing Meters constants.

Consider using a constant from the Meters class for consistency:

-        exception_counter = meter.create_counter(
-            name="llm.langchain.completions.exceptions",
+        exception_counter = meter.create_counter(
+            name="llm.langchain.completions.exceptions",  # Consider adding to Meters class
packages/opentelemetry-instrumentation-langchain/tests/metrics/test_langchain_metrics.py (1)

412-450: Exception metrics testing handles error scenarios appropriately.

The test properly validates that exception metrics are recorded when LLM calls fail, including correct error type attribution.

However, consider the static analysis suggestion about blind exception handling:

The test could be improved by catching a more specific exception type instead of the generic Exception:

-        except Exception:
-            pass  # Expected to fail
+        except Exception as e:
+            # Expected to fail - verify it's the mocked exception
+            assert "API Error" in str(e)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (1)

583-622: New token callback enables TTFT tracking.

The on_llm_new_token method properly implements TTFT tracking by recording the time when the first token arrives. The model name resolution with fallbacks ensures reliable attribution.

However, there are several unused method arguments flagged by static analysis:

Consider using underscore prefixes for intentionally unused parameters:

 def on_llm_new_token(
     self,
-    token: str,
+    _token: str,
     *,
-    chunk: Optional[Union[GenerationChunk, ChatGenerationChunk]] = None,
+    chunk: Optional[Union[GenerationChunk, ChatGenerationChunk]] = None,  # Keep for future use
     run_id: UUID,
-    parent_run_id: Optional[UUID] = None,
+    _parent_run_id: Optional[UUID] = None,
-    **kwargs: Any,
+    **_kwargs: Any,
 ) -> Any:
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (3)

90-93: Remove unused span_holder argument from _extract_model_name_from_request

It isn’t used. Trim the param and update the call to reduce coupling and silence ARG001.

-def _extract_model_name_from_request(
-    kwargs, span_holder: SpanHolder, serialized: Optional[dict] = None, metadata: Optional[dict] = None
-) -> str:
+def _extract_model_name_from_request(
+    kwargs, serialized: Optional[dict] = None, metadata: Optional[dict] = None
+) -> str:
@@
-    model = _extract_model_name_from_request(kwargs, span_holder, serialized, metadata)
+    model = _extract_model_name_from_request(kwargs, serialized, metadata)

Also applies to: 181-183


129-134: Avoid bare except/quiet pass around association_properties

Narrow the exception; silent broad catches hinder debugging and trigger BLE001/S110.

-    except Exception:
-        pass
+    except (ImportError, AttributeError):
+        # Best-effort lookup; ignore missing context API or attribute
+        pass

136-147: Clean up dict key checks (RUF019) and simplify access

Use dict.get to reduce branching and improve clarity.

-    if serialized:
-        if "kwargs" in serialized:
-            ser_kwargs = serialized["kwargs"]
-            for model_tag in ("model", "model_id", "model_name"):
-                if (model := ser_kwargs.get(model_tag)) is not None:
-                    return model
+    if serialized:
+        ser_kwargs = serialized.get("kwargs", {})
+        for model_tag in ("model", "model_id", "model_name"):
+            if (model := ser_kwargs.get(model_tag)) is not None:
+                return model
@@
-        if "id" in serialized and serialized["id"]:
-            class_name = serialized["id"][-1] if isinstance(serialized["id"], list) else str(serialized["id"])
+        sid = serialized.get("id")
+        if sid:
+            class_name = sid[-1] if isinstance(sid, list) else str(sid)
             return _infer_model_from_class_name(class_name, serialized)
-    if "ChatDeepSeek" in class_name:
-        # Check if serialized contains the actual model name
-        if "kwargs" in serialized:
-            ser_kwargs = serialized["kwargs"]
+    if "ChatDeepSeek" in class_name:
+        # Check if serialized contains the actual model name
+        ser_kwargs = serialized.get("kwargs", {})
         # ChatDeepSeek might store model in different fields
         for model_field in ("model", "_model", "model_name"):
             if model_field in ser_kwargs and ser_kwargs[model_field]:
                 return ser_kwargs[model_field]

Also applies to: 148-151, 161-166

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fd9626b and c69c758.

⛔ Files ignored due to path filters (1)
  • packages/opentelemetry-instrumentation-langchain/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/__init__.py (2 hunks)
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (8 hunks)
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (5 hunks)
  • packages/opentelemetry-instrumentation-langchain/pyproject.toml (1 hunks)
  • packages/opentelemetry-instrumentation-langchain/tests/conftest.py (1 hunks)
  • packages/opentelemetry-instrumentation-langchain/tests/metrics/cassettes/test_langchain_metrics/test_streaming_with_ttft_and_generation_time_metrics.yaml (1 hunks)
  • packages/opentelemetry-instrumentation-langchain/tests/metrics/test_langchain_metrics.py (3 hunks)
  • packages/opentelemetry-instrumentation-langchain/tests/test_model_extraction.py (1 hunks)
  • packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py (1 hunks)
  • packages/opentelemetry-instrumentation-langchain/tests/test_third_party_models.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-langchain/tests/conftest.py
  • packages/opentelemetry-instrumentation-langchain/tests/test_model_extraction.py
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/__init__.py
  • packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py
  • packages/opentelemetry-instrumentation-langchain/tests/test_third_party_models.py
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py
  • packages/opentelemetry-instrumentation-langchain/tests/metrics/test_langchain_metrics.py
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
**/cassettes/**/*.{yaml,yml,json}

📄 CodeRabbit inference engine (CLAUDE.md)

Never commit secrets or PII in VCR cassettes; scrub sensitive data

Files:

  • packages/opentelemetry-instrumentation-langchain/tests/metrics/cassettes/test_langchain_metrics/test_streaming_with_ttft_and_generation_time_metrics.yaml
🧬 Code graph analysis (6)
packages/opentelemetry-instrumentation-langchain/tests/test_model_extraction.py (1)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (4)
  • _extract_model_name_from_request (90-152)
  • _infer_model_from_class_name (155-172)
  • extract_model_name_from_response_metadata (474-505)
  • SpanHolder (27-37)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/__init__.py (2)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
  • Meters (36-61)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (1)
  • TraceloopCallbackHandler (147-1031)
packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py (3)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (4)
  • TraceloopCallbackHandler (147-1031)
  • on_llm_new_token (584-621)
  • on_llm_end (624-748)
  • on_llm_error (881-890)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (1)
  • SpanHolder (27-37)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
  • SpanAttributes (64-261)
packages/opentelemetry-instrumentation-langchain/tests/test_third_party_models.py (3)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (3)
  • TraceloopCallbackHandler (147-1031)
  • on_chat_model_start (522-550)
  • on_llm_end (624-748)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (2)
  • SpanHolder (27-37)
  • _extract_model_name_from_request (90-152)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
  • SpanAttributes (64-261)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (2)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (7)
  • SpanHolder (27-37)
  • set_request_params (175-222)
  • set_llm_request (225-245)
  • _get_unified_unknown_model (58-87)
  • _set_span_attribute (53-55)
  • set_chat_response_usage (387-471)
  • set_chat_response (313-384)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/utils.py (1)
  • dont_throw (55-77)
packages/opentelemetry-instrumentation-langchain/tests/metrics/test_langchain_metrics.py (3)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (2)
  • Meters (36-61)
  • SpanAttributes (64-261)
packages/opentelemetry-instrumentation-langchain/tests/conftest.py (1)
  • instrument_legacy (77-95)
packages/opentelemetry-instrumentation-milvus/tests/conftest.py (1)
  • reader (37-41)
🪛 Ruff (0.13.1)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py

171-171: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


193-194: try-except-pass detected, consider logging the exception

(S110)


193-193: Do not catch blind exception: Exception

(BLE001)


586-586: Unused method argument: token

(ARG002)


588-588: Unused method argument: chunk

(ARG002)


590-590: Unused method argument: parent_run_id

(ARG002)


591-591: Unused method argument: kwargs

(ARG002)

packages/opentelemetry-instrumentation-langchain/tests/metrics/test_langchain_metrics.py

309-309: Unused function argument: instrument_legacy

(ARG001)


412-412: Unused function argument: instrument_legacy

(ARG001)


429-430: try-except-pass detected, consider logging the exception

(S110)


429-429: Do not catch blind exception: Exception

(BLE001)

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py

58-58: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


58-58: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


91-91: Unused function argument: span_holder

(ARG001)


133-134: try-except-pass detected, consider logging the exception

(S110)


133-133: Do not catch blind exception: Exception

(BLE001)


148-148: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)


165-165: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)

🔇 Additional comments (27)
packages/opentelemetry-instrumentation-langchain/tests/conftest.py (1)

29-29: pytest-recording declared in test deps — no action required.

Confirmed pytest-recording = "^0.13.2" in packages/opentelemetry-instrumentation-langchain/pyproject.toml and pytest_plugins = ["pytest_recording"] in packages/opentelemetry-instrumentation-langchain/tests/conftest.py.

packages/opentelemetry-instrumentation-langchain/pyproject.toml (2)

57-57: LGTM!

The addition of langchain-deepseek ^0.1.4 as a test dependency is appropriate for supporting the third-party model integration tests in this PR.


57-57: Keep dependency as-is — langchain-deepseek ^0.1.4 is available on PyPI.
Verified via pip index (available versions include 0.1.4); no change required.

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/__init__.py (2)

20-20: Import added for new AI metrics.

The import of GenAIMetrics from the incubating metrics module enables access to standardized metric names for TTFT tracking.


107-109: New metrics properly wired into callback handler.

The expanded constructor call correctly passes all six metric instruments to TraceloopCallbackHandler, enabling the streaming TTFT and completion metrics functionality.

packages/opentelemetry-instrumentation-langchain/tests/test_third_party_models.py (4)

12-30: Setup method properly initializes all required mock metrics.

The test setup correctly mocks all six metric instruments that are now required by the updated TraceloopCallbackHandler constructor.


31-66: ChatDeepSeek model extraction test is comprehensive.

The test properly validates model extraction from serialized kwargs, verifies span attributes are set correctly, and confirms the request_model is captured in the SpanHolder.


74-93: Fallback behavior correctly tested.

The test confirms that when no model is provided in serialized kwargs, the system falls back to "deepseek-unknown" as expected.


108-117: Token metrics validation ensures correct model attribution.

The test properly verifies that token usage metrics use the correct model name from the request fallback, ensuring consistent attribution across the instrumentation.

packages/opentelemetry-instrumentation-langchain/tests/metrics/test_langchain_metrics.py (3)

8-9: New imports support extended metrics validation.

The imports of ERROR_TYPE and GenAIMetrics enable testing of the new streaming metrics and exception handling.


126-161: Streaming metrics validation is thorough.

The test properly validates that generation choices, TTFT, and streaming time metrics are recorded with correct attributes and values during streaming operations.


308-410: ChatDeepSeek streaming test validates third-party integration fixes.

The test comprehensively validates that the ChatDeepSeek integration properly records all streaming metrics (TTFT, generation time, choices) with correct model name attribution. The test confirms the fixes for third-party model handling work as expected.

packages/opentelemetry-instrumentation-langchain/tests/test_model_extraction.py (3)

15-27: Test setup properly initializes SpanHolder.

The setup method correctly creates a SpanHolder instance for testing model extraction functions.


46-73: DeepSeek model extraction tests cover key scenarios.

The tests properly validate both successful model extraction from serialized kwargs and fallback behavior when model information is missing.


127-147: Priority order testing ensures correct precedence.

The test confirms that model extraction follows the correct priority order (kwargs > invocation_params > nested structures), which is crucial for reliable model identification.

packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py (4)

14-33: Test setup correctly mocks all required metrics instruments.

The setup properly initializes all six metric mocks needed for the updated callback handler.


54-56: Time-based testing approach with potential CI fragility.

The test uses time.sleep(0.1) to simulate delays, which could be unreliable in CI environments with varying system load.

Consider using time mocking for more stable tests:

from unittest.mock import patch
import time

@patch('time.time')
def test_ttft_metric_recorded_on_first_token(self, mock_time):
    # Control time progression explicitly
    mock_time.side_effect = [start_time, start_time + 0.1, start_time + 0.15]
    # ... rest of test

34-67: TTFT metric test properly validates first token timing.

The test correctly verifies that the TTFT metric is recorded when the first token arrives and includes proper attribute validation.


131-168: Streaming time metric test validates end-to-end timing.

The test properly simulates the complete streaming flow from first token to completion and validates that streaming time is calculated correctly.

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (5)

57-57: Counter import added for new metrics.

The import enables usage of Counter metrics alongside existing Histogram metrics.


148-157: Constructor signature expanded for new metrics.

The constructor correctly accepts all six metric instruments needed for comprehensive LLM observability, including TTFT, streaming time, choices, and exception metrics.


663-676: Enhanced model name resolution with comprehensive fallbacks.

The improved model name resolution ensures that a valid model name is always available for metric attribution, preventing None values from breaking downstream processing.


706-747: Metrics recording uses shared attributes for consistency.

The updated metrics recording in on_llm_end properly uses the shared attributes helper to ensure consistent attribution across all metrics. The streaming time calculation and choices counting are correctly implemented.


861-877: Exception metrics properly track error events.

The enhanced error handling in _handle_error correctly records exception metrics with proper model attribution and error type information.

packages/opentelemetry-instrumentation-langchain/tests/metrics/cassettes/test_langchain_metrics/test_streaming_with_ttft_and_generation_time_metrics.yaml (1)

1-220: Test cassette correctly captures DeepSeek streaming — no secrets found.

grep matched only 'token' occurrences in the response "usage" fields (prompt_tokens/completion_tokens); no api_key, secret, authorization/bearer header, or other credentials found in headers or body.

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (2)

37-37: New SpanHolder.first_token_time: confirm monotonic clock source

For TTFT correctness, ensure producers set this using a monotonic clock (time.perf_counter() or time.monotonic()), and that start_time uses the same clock when computing deltas.


54-55: Attribute guard looks good

Rejecting None/empty-string values before setting span attributes is correct.

Comment on lines +475 to 506
"""Enhanced model name extraction from response metadata with third-party support."""

# Standard extraction from response metadata
for generations in response.generations:
for generation in generations:
if (
getattr(generation, "message", None)
and getattr(generation.message, "response_metadata", None)
and (model_name := generation.message.response_metadata.get("model_name"))
):
metadata = generation.message.response_metadata
# Try multiple possible model name fields
for model_field in ("model_name", "model", "model_id"):
if (model_name := metadata.get(model_field)):
return model_name

# Enhanced extraction for third-party models
# Check if llm_output contains model information
if response.llm_output:
for model_field in ("model", "model_name", "model_id"):
if (model_name := response.llm_output.get(model_field)):
return model_name

# Check generation_info for model information
for generations in response.generations:
for generation in generations:
if hasattr(generation, "generation_info") and generation.generation_info:
for model_field in ("model", "model_name", "model_id"):
if (model_name := generation.generation_info.get(model_field)):
return model_name

return None

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Type mismatch: function returns None but annotated as str

Change return type to Optional[str] to reflect behavior and prevent downstream assumptions.

-def extract_model_name_from_response_metadata(response: LLLMResult) -> str:
+def extract_model_name_from_response_metadata(response: LLMResult) -> Optional[str]:

Run to check callsites expecting str:


🏁 Script executed:

#!/bin/bash
rg -nP -C2 '\bextract_model_name_from_response_metadata\s*\('

Length of output: 3287


Change return type to Optional[str]

Function can return None when no model field is found — update the signature to:
def extract_model_name_from_response_metadata(response: LLMResult) -> Optional[str]
and add "from typing import Optional".
Call sites found: packages/opentelemetry-instrumentation-langchain/tests/test_model_extraction.py (asserts returned string) and packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (already checks for None). No other callers found.

🤖 Prompt for AI Agents
In
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
around lines 475 to 506, the function can return None but the signature
currently does not reflect that; update the signature to def
extract_model_name_from_response_metadata(response: LLMResult) -> Optional[str],
add from typing import Optional at the top of the file, and run/update the test
at
packages/opentelemetry-instrumentation-langchain/tests/test_model_extraction.py
to accept an Optional[str] return (adjust assertions accordingly) so callers
that already handle None continue to work.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (2)

677-715: Guard token_usage math against None to avoid TypeError.

If a field exists with value 0, the current “or”-chain can yield None and break addition.

-        if token_usage is not None:
-            prompt_tokens = (
-                token_usage.get("prompt_tokens")
-                or token_usage.get("input_token_count")
-                or token_usage.get("input_tokens")
-            )
-            completion_tokens = (
-                token_usage.get("completion_tokens")
-                or token_usage.get("generated_token_count")
-                or token_usage.get("output_tokens")
-            )
-            total_tokens = token_usage.get("total_tokens") or (
-                prompt_tokens + completion_tokens
-            )
+        if token_usage is not None:
+            def _first_present(d, keys):
+                for k in keys:
+                    v = d.get(k)
+                    if v is not None:
+                        return v
+                return 0
+            prompt_tokens = int(_first_present(token_usage, ("prompt_tokens", "input_token_count", "input_tokens")))
+            completion_tokens = int(_first_present(token_usage, ("completion_tokens", "generated_token_count", "output_tokens")))
+            total_tokens = int(token_usage.get("total_tokens") or (prompt_tokens + completion_tokens))

270-288: Avoid private _RUNTIME_CONTEXT; use public detach and suppress expected errors.

Private API may change across OTel versions.

-        try:
-            # Use the runtime context directly to avoid error logging from context_api.detach()
-            from opentelemetry.context import _RUNTIME_CONTEXT
-
-            _RUNTIME_CONTEXT.detach(token)
-        except Exception:
-            # Context detach can fail in async scenarios when tokens are created in different contexts
-            # This includes ValueError, RuntimeError, and other context-related exceptions
-            # This is expected behavior and doesn't affect the correct span hierarchy
-            #
-            # Common scenarios where this happens:
-            # 1. Token created in one async task/thread, detached in another
-            # 2. Context was already detached by another process
-            # 3. Token became invalid due to context switching
-            # 4. Race conditions in highly concurrent scenarios
-            #
-            # This is safe to ignore as the span itself was properly ended
-            # and the tracing data is correctly captured.
-            pass
+        from contextlib import suppress
+        with suppress(Exception):
+            context_api.detach(token)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (1)

315-373: Bug: type check uses is str; should use isinstance. Also guard generation_info access.

Prevents misclassification and AttributeError.

-                if generation.message.content is str:
+                if isinstance(generation.message.content, str):
@@
-                if generation.generation_info.get("finish_reason"):
+                if getattr(generation, "generation_info", {}) and generation.generation_info.get("finish_reason"):
                     _set_span_attribute(
🧹 Nitpick comments (6)
packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py (1)

72-97: Optional: mock time here too for consistency.

This test only checks call count, but mocking time like other tests keeps timing fully deterministic across the suite.

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (3)

170-207: Type: avoid implicit Optional and add light debug on metadata lookup failures.

operation_type should be Optional[str]; keep narrow excepts but don’t silently swallow everything forever.

-def _create_shared_attributes(
-        self, span, model_name: str, operation_type: str = None, is_streaming: bool = False
-    ) -> dict:
+def _create_shared_attributes(
+        self, span, model_name: str, operation_type: Optional[str] = None, is_streaming: bool = False
+    ) -> dict:
@@
-        except (AttributeError, KeyError, TypeError):
-            pass
+        except (AttributeError, KeyError, TypeError):
+            # Expected when context/association_properties is unavailable
+            server_address = None

583-622: Silence unused-arg warnings without breaking LangChain’s callback API; move import to module scope.

Keep parameter names to retain compatibility, but mark them as intentionally unused and avoid per-call import.

-from opentelemetry.instrumentation.langchain.span_utils import _get_unified_unknown_model
+from opentelemetry.instrumentation.langchain.span_utils import _get_unified_unknown_model
@@
-    def on_llm_new_token(
+    def on_llm_new_token(
         self,
-        token: str,
+        token: str,  # noqa: ARG002
         *,
-        chunk: Optional[Union[GenerationChunk, ChatGenerationChunk]] = None,
+        chunk: Optional[Union[GenerationChunk, ChatGenerationChunk]] = None,  # noqa: ARG002
         run_id: UUID,
-        parent_run_id: Optional[UUID] = None,
-        **kwargs: Any,
+        parent_run_id: Optional[UUID] = None,  # noqa: ARG002
+        **kwargs: Any,  # noqa: ARG002
     ) -> Any:
@@
-            from opentelemetry.instrumentation.langchain.span_utils import _get_unified_unknown_model
-
             model_name = (

637-676: Include 'model' in the first llm_output extraction to avoid unnecessary fallbacks.

Saves a pass when providers set llm_output["model"].

-        if response.llm_output is not None:
-            model_name = response.llm_output.get(
-                "model_name"
-            ) or response.llm_output.get("model_id")
+        if response.llm_output is not None:
+            model_name = (
+                response.llm_output.get("model_name")
+                or response.llm_output.get("model_id")
+                or response.llm_output.get("model")
+            )
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (2)

58-90: Case-insensitive class-name heuristics.

Lowercase class_name to avoid missing vendor due to casing differences.

-    # Fallback to class name-based inference
-    if class_name:
-        if "ChatDeepSeek" in class_name:
+    # Fallback to class name-based inference
+    if class_name:
+        class_lower = class_name.lower()
+        if "chatdeepseek" in class_lower:
             return "deepseek-unknown"
-        elif "ChatOpenAI" in class_name:
+        elif "chatopenai" in class_lower:
             return "gpt-unknown"
-        elif "ChatAnthropic" in class_name:
+        elif "chatanthropic" in class_lower:
             return "claude-unknown"
-        elif "ChatCohere" in class_name:
+        elif "chatcohere" in class_lower:
             return "command-unknown"
-        elif "ChatOllama" in class_name:
+        elif "chatollama" in class_lower:
             return "ollama-unknown"

92-155: _extract_model_name_from_request: remove unused param and simplify dict access.

span_holder isn’t used; replace key checks with .get to satisfy Ruff and readability.

-def _extract_model_name_from_request(
-    kwargs, span_holder: SpanHolder, serialized: Optional[dict] = None, metadata: Optional[dict] = None
-) -> str:
+def _extract_model_name_from_request(
+    kwargs, serialized: Optional[dict] = None, metadata: Optional[dict] = None
+) -> str:
@@
-    if "kwargs" in kwargs:
-        nested_kwargs = kwargs["kwargs"]
+    nested_kwargs = kwargs.get("kwargs") or {}
     for model_tag in ("model", "model_id", "model_name"):
-        if (model := nested_kwargs.get(model_tag)) is not None:
+        if (model := nested_kwargs.get(model_tag)) is not None:
             return model
@@
-    if serialized:
-        if "kwargs" in serialized:
-            ser_kwargs = serialized["kwargs"]
+    if serialized:
+        ser_kwargs = serialized.get("kwargs") or {}
         for model_tag in ("model", "model_id", "model_name"):
-            if (model := ser_kwargs.get(model_tag)) is not None:
+            if (model := ser_kwargs.get(model_tag)) is not None:
                 return model
@@
-        if "id" in serialized and serialized["id"]:
-            class_name = serialized["id"][-1] if isinstance(serialized["id"], list) else str(serialized["id"])
+        if serialized.get("id"):
+            class_name = serialized["id"][-1] if isinstance(serialized["id"], list) else str(serialized["id"])
             return _infer_model_from_class_name(class_name, serialized)

And update callers:

-    model = _extract_model_name_from_request(kwargs, span_holder, serialized, metadata)
+    model = _extract_model_name_from_request(kwargs, serialized, metadata)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c69c758 and 7227a1b.

📒 Files selected for processing (4)
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (9 hunks)
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (5 hunks)
  • packages/opentelemetry-instrumentation-langchain/tests/test_model_extraction.py (1 hunks)
  • packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/opentelemetry-instrumentation-langchain/tests/test_model_extraction.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
🧬 Code graph analysis (2)
packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py (2)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (4)
  • TraceloopCallbackHandler (147-1030)
  • on_llm_new_token (584-621)
  • on_llm_end (624-747)
  • on_llm_error (880-889)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (1)
  • SpanHolder (27-37)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (3)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (4)
  • SpanHolder (27-37)
  • set_request_params (177-224)
  • _get_unified_unknown_model (58-89)
  • _set_span_attribute (53-55)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
  • SpanAttributes (64-261)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/utils.py (2)
  • should_emit_events (80-87)
  • dont_throw (55-77)
🪛 Ruff (0.13.1)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py

171-171: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


586-586: Unused method argument: token

(ARG002)


588-588: Unused method argument: chunk

(ARG002)


590-590: Unused method argument: parent_run_id

(ARG002)


591-591: Unused method argument: kwargs

(ARG002)

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py

93-93: Unused function argument: span_holder

(ARG001)


150-150: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)


167-167: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build Packages (3.11)
  • GitHub Check: Test Packages (3.12)
  • GitHub Check: Test Packages (3.11)
  • GitHub Check: Test Packages (3.10)
  • GitHub Check: Lint
🔇 Additional comments (10)
packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py (4)

34-71: Good: time is mocked; resolves prior flakiness concern.

Mocking time in both callback_handler and span_utils modules stabilizes TTFT assertions. This addresses the earlier CI fragility raised about time.sleep().


98-134: Choices metric coverage looks good.

Checks count and core attrs; no issues.


183-213: Exception metric assertion is correct.

Covers error.type and model fallback.


215-242: Good negative test for streaming metric.

Ensures no record without first token.

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (3)

546-547: Fix validated: set_request_params now receives real kwargs.

This resolves the earlier bug where serialized['kwargs'] dropped invocation_params.

Also applies to: 576-577


846-876: Exception metric path: solid.

Includes vendor/model attrs and error.type; aligns with tests.


147-166: Constructor surface expansion — call sites updated.
All TraceloopCallbackHandler(...) instantiations pass the seven parameters in the correct order: tracer, duration_histogram, token_histogram, ttft_histogram, streaming_time_histogram, choices_counter, exception_counter.
Verified locations:

  • packages/opentelemetry-instrumentation-langchain/tests/test_third_party_models.py:21–23
  • packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py:24–26
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/init.py:106–108
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (3)

37-37: SpanHolder: first_token_time addition LGTM.

Enables TTFT/streaming metrics.


476-508: Return type corrected to Optional[str].

Matches behavior and downstream checks.


510-514: Association metadata extractor: concise and correct.

No issues.

Comment on lines +135 to +181
def test_streaming_time_to_generate_metric(self):
"""Test that streaming time to generate metric is recorded."""
run_id = uuid4()
mock_span = Mock(spec=Span)
mock_span.attributes = {SpanAttributes.LLM_SYSTEM: "Langchain"}

with patch('opentelemetry.instrumentation.langchain.callback_handler.time.time') as mock_time, \
patch('opentelemetry.instrumentation.langchain.span_utils.time.time') as mock_span_time:

start_time = 1000.0
mock_time.return_value = start_time
mock_span_time.return_value = start_time

span_holder = SpanHolder(
span=mock_span,
token=None,
context=None,
children=[],
workflow_name="test",
entity_name="test",
entity_path="test",
start_time=start_time
)
self.handler.spans[run_id] = span_holder

first_token_time = start_time + 0.05
mock_time.return_value = first_token_time
mock_span_time.return_value = first_token_time
self.handler.on_llm_new_token("Hello", run_id=run_id)

completion_time = first_token_time + 0.05
mock_time.return_value = completion_time
mock_span_time.return_value = completion_time
llm_result = LLMResult(
generations=[[Generation(text="Hello world")]],
llm_output={"model_name": "test-model"}
)

self.handler.on_llm_end(llm_result, run_id=run_id)

self.streaming_time_histogram.record.assert_called_once()
args = self.streaming_time_histogram.record.call_args
streaming_time = args[0][0]
assert abs(streaming_time - 0.05) < 0.001, (
f"Streaming time should be approximately 0.05 seconds, "
f"got {streaming_time}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Also assert streaming flag in metric attributes.

Verify attributes include stream=True to ensure streaming path is exercised.

Apply this minimal assertion:

@@
-            self.streaming_time_histogram.record.assert_called_once()
-            args = self.streaming_time_histogram.record.call_args
+            self.streaming_time_histogram.record.assert_called_once()
+            args = self.streaming_time_histogram.record.call_args
             streaming_time = args[0][0]
@@
-            )
+            )
+            assert args[1]["attributes"].get("stream") is True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_streaming_time_to_generate_metric(self):
"""Test that streaming time to generate metric is recorded."""
run_id = uuid4()
mock_span = Mock(spec=Span)
mock_span.attributes = {SpanAttributes.LLM_SYSTEM: "Langchain"}
with patch('opentelemetry.instrumentation.langchain.callback_handler.time.time') as mock_time, \
patch('opentelemetry.instrumentation.langchain.span_utils.time.time') as mock_span_time:
start_time = 1000.0
mock_time.return_value = start_time
mock_span_time.return_value = start_time
span_holder = SpanHolder(
span=mock_span,
token=None,
context=None,
children=[],
workflow_name="test",
entity_name="test",
entity_path="test",
start_time=start_time
)
self.handler.spans[run_id] = span_holder
first_token_time = start_time + 0.05
mock_time.return_value = first_token_time
mock_span_time.return_value = first_token_time
self.handler.on_llm_new_token("Hello", run_id=run_id)
completion_time = first_token_time + 0.05
mock_time.return_value = completion_time
mock_span_time.return_value = completion_time
llm_result = LLMResult(
generations=[[Generation(text="Hello world")]],
llm_output={"model_name": "test-model"}
)
self.handler.on_llm_end(llm_result, run_id=run_id)
self.streaming_time_histogram.record.assert_called_once()
args = self.streaming_time_histogram.record.call_args
streaming_time = args[0][0]
assert abs(streaming_time - 0.05) < 0.001, (
f"Streaming time should be approximately 0.05 seconds, "
f"got {streaming_time}"
)
def test_streaming_time_to_generate_metric(self):
"""Test that streaming time to generate metric is recorded."""
run_id = uuid4()
mock_span = Mock(spec=Span)
mock_span.attributes = {SpanAttributes.LLM_SYSTEM: "Langchain"}
with patch('opentelemetry.instrumentation.langchain.callback_handler.time.time') as mock_time, \
patch('opentelemetry.instrumentation.langchain.span_utils.time.time') as mock_span_time:
start_time = 1000.0
mock_time.return_value = start_time
mock_span_time.return_value = start_time
span_holder = SpanHolder(
span=mock_span,
token=None,
context=None,
children=[],
workflow_name="test",
entity_name="test",
entity_path="test",
start_time=start_time
)
self.handler.spans[run_id] = span_holder
first_token_time = start_time + 0.05
mock_time.return_value = first_token_time
mock_span_time.return_value = first_token_time
self.handler.on_llm_new_token("Hello", run_id=run_id)
completion_time = first_token_time + 0.05
mock_time.return_value = completion_time
mock_span_time.return_value = completion_time
llm_result = LLMResult(
generations=[[Generation(text="Hello world")]],
llm_output={"model_name": "test-model"}
)
self.handler.on_llm_end(llm_result, run_id=run_id)
self.streaming_time_histogram.record.assert_called_once()
args = self.streaming_time_histogram.record.call_args
streaming_time = args[0][0]
assert abs(streaming_time - 0.05) < 0.001, (
f"Streaming time should be approximately 0.05 seconds, "
f"got {streaming_time}"
)
assert args[1]["attributes"].get("stream") is True
🤖 Prompt for AI Agents
In
packages/opentelemetry-instrumentation-langchain/tests/test_streaming_metrics.py
around lines 135 to 181, the test verifies the recorded streaming time but does
not assert that the metric attributes indicate streaming; add a minimal
assertion after retrieving the histogram.record call args to check that the
recorded metric's attributes include stream=True (i.e., extract the attributes
from the record call kwargs and assert attributes.get("stream") is True) so the
test verifies the streaming path is exercised.

@minimAluminiumalism
Copy link
Contributor Author

@nirga Please review this PR related to the LangChain instrumentation bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant